Skip to content

Conversation

@GhostCai
Copy link

This PR ensures the progress bar starts from 0 when resuming from a checkpoint, fixing an issue where it previously started at global_step and caused wrong progress tracking.

@bghira
Copy link
Contributor

bghira commented Oct 29, 2024

this isn't correct. it's supposed to resume from global_step

@GhostCai
Copy link
Author

this isn't correct. it's supposed to resume from global_step

Hi, thank you for the feedback! I believe there’s still an issue—the progress bar is updated within the resume-skipping section of the code (L855). To resolve this, we could either remove the progress bar update in that section or adjust its range accordingly.

This issue was already addressed in train_instruct_pix2pix_sdxl.py by setting the full range, but it seems the fix wasn’t applied here somehow.

@bghira
Copy link
Contributor

bghira commented Oct 29, 2024

the other scripts iirc are simply jumping straight to the resume range position, and then continue inside the loop is antiquated and should be removed instead

@bghira
Copy link
Contributor

bghira commented Oct 29, 2024

it's debated whether stepping the dataloader was needed, or if that's needed at all any longer, since the random states are resumed by accelerate. i believe before accelerate was as robust/full-featured, stepping the loop N times was needed to continue training in a reproducible manner.

@GhostCai
Copy link
Author

Exactly. Using continue in this way is safer but can lead to unnecessary data loading. In this code, both a shortened range and continue are used together, resulting in an incorrect range. For example, if I want to train for 20k steps and resume from an 8k checkpoint, the range is set to range(8k, 20k). However, after all the continue statements, the current iteration counter updates to 16k, leaving only 4k steps instead of the expected 12k for training. So I think we should adjust the range.

@bghira
Copy link
Contributor

bghira commented Oct 29, 2024

well, no - remove the continue statement, so that this script becomes consistent across the project's other script examples.

@GhostCai
Copy link
Author

OK that can solve the problem too. I’ve removed the continue statement and set the initial value to global_step instead.

@yiyixuxu yiyixuxu requested a review from sayakpaul November 6, 2024 00:44
@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Comment on lines -854 to -855
if step % args.gradient_accumulation_steps == 0:
progress_bar.update(1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we removing this?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because we have set initial=global_step in the progress_bar so it will be consistent with train_instruct_pix2pix_sdxl.py's progress_bar setup? As is proposed by bghira in here. If we don't remove this statement, the initial should be set to zero.

@sayakpaul
Copy link
Member

Can we try to harmonize the changes by following what we do here?

I think it's better to have better consistency that way.

@GhostCai
Copy link
Author

GhostCai commented Nov 7, 2024

Can we try to harmonize the changes by following what we do here?

I think it's better to have better consistency that way.

Yes, we can work towards harmonizing the changes. The main difference here lies in whether or not we skip the loop N times. This file does while the SDXL file doesn't. However, based on our previous discussion, skipping the loop N times in this implementation is needed for maintaining reproducibility in training. So, to ensure consistency, it may actually be the SDXL version that needs to be aligned with this approach. Or, if we believe accelerate is robust enough, we could remove the continue in this file to match the SDXL version. Thoughts?

@sayakpaul
Copy link
Member

Thanks for explaining further!

Could you please provide a visual example comparing the changes you would like to see if it's not too much?

@sayakpaul
Copy link
Member

Gentle ping @GhostCai

@github-actions
Copy link
Contributor

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@github-actions github-actions bot added the stale Issues that haven't received updates label Dec 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

stale Issues that haven't received updates

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants